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Introduction 



Advanced Memory Protection (AMP) consists of memory features that provide increased tolerance and 
protection from memory failures. There are varying levels of AMP that are supported on ProLiant 
servers, depending on the class of server. Refer to the product QuickSpecs for specific information on 
the level of features supported on each ProLiant server. 

AMP features include Advanced ECC, Online Spare Memory, Memory Mirroring and RAID. 
Advanced ECC and Online Spare are supported on 300 series platforms. The focus of this 
whitepaper is to detail Advanced ECC and Online Spare support for the 300 series platforms and 
will cover how these features are enabled, the configuration rules for using these features, what 
utilities can be used for monitoring failures, and how the failures can be repaired. 

Memory failures defined 

There are differing degrees of memory failures that impact the severity of the state of the server. 
Memory errors can be classified into correctable errors and uncorrectable errors. 

Correctable errors can be detected and corrected if the chipset and DIMM support this functionality. 
Error detection and correction is implemented by storing data and ECC bits on the DIMM. By utilizing 
the data and ECC bits, the system can detect memory errors and correct certain types of failures. 
Correctable errors are generally single-bit errors. All ProLiant 300-series servers are capable of 
detecting and correcting single-bit errors. In addition, ProLiant servers with Advanced ECC support 
can detect and correct some multi-bit errors. HP's Advanced ECC allows detection and correction of 
multi-bit failures if all failed bits are contained within a single DRAM device on the DIMM. 

Correctable errors can be classified as "hard" and "soft" errors. With a hard error, every access to 
the memory location will return an error. A hard error typically indicates a problem with the DIMM. 
With a soft error, the data and/or ECC bits on the DIMM are incorrect, but the error will not continue 
to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors are typically 
caused by cosmic rays. They are rare but expected occurrences. 

Although hard correctable memory errors are corrected by the system and will not result in system 
downtime or data corruption, they indicate a problem with the hardware. On the other hand, soft 
errors do not indicate any issue with the hardware. Due to this, HP ProLiant servers track the rate of 
correctable errors through correctable error thresholding. This allows the system to differentiate 
between hard and soft errors. A soft error will not typically cause a DIMM to exceed HP's correctable 
error threshold. On the other hand, a hard error will typically cause a DIMM to exceed HP's 
correctable error threshold. Due to HP's correctable error thresholding, the user is warned about hard 
correctable errors, but is not notified about soft errors which don't indicate any issue with the 
hardware. HP suggests that corrective action be taken if a DIMM is receiving correctable errors at a 
rate higher than HP's correctable error threshold rate. Even though a DIMM has exceeded the 
correctable threshold, future errors will continue to be corrected. The system will not shutdown or 
crash due to additional correctable errors. However, a DIMM that is receiving correctable errors at a 
high rate has a higher probability of receiving an uncorrectable error, which would result in a system 
crash or shutdown for systems not configured for the Mirroring or RAID AMP modes. 

The user is warned about a DIMM exceeding the correctable error threshold in multiple ways. The 
systems internal Health LED will indicate a caution condition. On most ProLiant 300-series servers, an 
LED next to the DIMM exceeding the threshold will be illuminated. In addition, if the System 
Management Driver and agents are loaded, a message will be logged to both the console and 
Systems Insight Manager. Correctable memory errors can typically be isolated to the actual failed 
DIMM. 
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While correctable errors do not affect the normal operation of the system, uncorrectable memory 
errors will immediately result in a system crash or shutdown of the system when not configured for 
Mirroring or RAID AMP modes. Uncorrectable errors are detected by ProLiant 300-series servers, but 
cannot be corrected. ProLiant 500-series and 700-series platforms with Mirroring or RAID AMP 
support are capable of protecting against uncorrectable memory errors. Uncorrectable errors are 
always multi-bit memory errors. For systems with Advanced ECC support, multi-bit memory errors 
within the same DRAM device on the DIMM are not uncorrectable. However, if multiple bits are failed 
on different DRAM devices on a DIMM, the error will be uncorrectable. When a system receives an 
uncorrectable error and is not in an AMP mode providing protection against these errors, the system 
will NMI. The internal Health LED will indicate a critical condition, and on most systems, the LEDs next 
to the failed DIMMs will be illuminated. In addition, the error will be logged if the Systems 
Management Driver is loaded. In certain cases (typically when the failed memory is in the first Bank of 
memory), the NMI handler will be incapable of running because the memory where the NMI handler 
resides will be corrupted. In these cases, the system will typically hard lock without any additional 
indication regarding the failure. Uncorrectable memory errors can typically only be isolated down to 
a failed Bank of DIMMs, rather than the DIMM itself. 

Protection from memory failures 

There are six levels of protection from memory errors that are supported by HP. In this whitepaper, the 
focus will be on those levels of protection supported by the 300-series G4 class of servers. Each level 
of protection requires server support. 

The base level of memory protection available is parity protection. All ProLiant 300-series platforms 
provide memory protection beyond that provided by parity. Parity can detect when a single-bit error 
occurs, but cannot correct it. When a single-bit error occurs on a system with parity protection, the 
system will hard lock causing a non-maskable interrupt (NMI). Thus, single-bit errors are uncorrectable 
errors on a system with parity protection. In parity mode, there is no protection from any level of 
memory failures because the ability to correct the failure does not exist. 

The next level of protection is Standard ECC. Standard ECC requires chipset and DIMM level support 
and provides the capability to detect and correct a single-bit error on a memory access. When a 
single-bit error occurs, the system will detect the error and correct the data. Thus, the system will 
continue to operate normally. With Standard ECC, all multi-bit memory errors will be detected, but not 
corrected. Multi-bit errors are uncorrectable and will result in a system crash and NMI. 

A more robust level of protection is provided by Advanced ECC, also known in the industry as 
"Chipkill." Advanced ECC requires chipset and DIMM support and provides a higher level of 
protection over Standard ECC. Like Standard ECC, Advanced ECC will detect and correct single-bit 
errors. However, Advanced ECC will also detect and correct multi-bit errors if all failed bits are within 
a single DRAM device on the DIMM. An entire DRAM device on the DIMM can be failed, and the 
system will continue to operate normally. If there are multiple bits of failure that occur on multiple 
DRAM devices on the DIMM, the error cannot be corrected with Advanced ECC support, and the 
system will crash and NMI. 

HP offers memory protection beyond those features listed above. ProLiant 300-series servers support 
Online Spare Mode. With Online Spare enabled, the system still takes advantage of Advanced ECC. 
In Online Spare Mode, one bank of memory is designated as the spare bank. In this mode, the 
designated bank is not used for total available system memory. If the correctable error threshold is 
exceeded by a DIMM in a particular bank of memory, that bank will be taken offline and the spare 
bank activated instead. Once the original bank is deactivated, the system will not utilize the memory 
that exhibited the failure. After switching to the spare bank of memory, the system will continue to 
monitor correctable threshold errors and log any failures. If an uncorrectable memory error occurs 
before or after the online spare switchover, the system will crash and NMI. However, the memory 
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which exceeded the correctable threshold and was deactivated cannot result in an uncorrectable error 
once the online spare switchover is complete. 

Benefits of online spare memory 

With Online Spare Memory, degraded memory is automatically disengaged and a fresh set of 
memory is used in its place. This brings the reliability of the system to the pre-failure level without any 
service interruption and without compromising system availability. 

This solution is beneficial to businesses that do not have a permanent IT staff, do not have 
replacement memory on hand, or cannot bring down the server for any reason until a scheduled 
downtime. If a memory module has achieved its pre-defined threshold of correctable memory errors, 
its chance of encountering uncorrectable errors increases dramatically. Online Spare allows the 
system to automatically deactivate memory that is at a high risk of receiving an uncorrectable error, 
and replace it with good memory. No interruption to system operation occurs. An uncorrectable error 
would result in a system crash and unscheduled downtime. Thus, Online Spare Mode decreases the 
chances of unscheduled downtime and system crashes due to uncorrectable memory errors. 

Online Spare Memory is a higher level of memory protection that complements Advanced ECC 
support. Online Spare Memory is a user selectable option. Users can choose to disable Online Spare 
and make all installed memory available to the operating system and applications, or they can 
choose to enable Online Spare and reduce the amount of memory available to the OS and 
applications in return for a higher level of protection against uncorrectable memory errors. By default, 
Online Spare Mode is disabled. 

Deployment considerations 

There are a few key factors to consider when determining what level of AMP support should be 
enabled: 

• What features are supported on the ProLiant server being deployed? 

• What level of protection is desired? 

• What the cost of implementation is for the AMP mode? 

To determine what AMP features are supported on your ProLiant server, refer to the Product 
QuickSpecs. The above sections detail the various protection modes and the benefits of each. The 
cost of implementation for Online Spare over Advanced ECC is the hardware cost of the extra DIMMs 
required for the spare bank. If Standard ECC or Advanced ECC is implemented, there is no cost 
associated with extra hardware. 

Implementation differences between ProLiant 300 series G3 
and G4 servers 

The implementation of AMP support for G4 300-series ProLiant servers is very similar to the 
implementation on G3 servers with the following exceptions: 

• The configuration rules have changes in regards to dual-ranked DIMMs (see "Configuration rules," 
below). 
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• Memory will automatically be tested in POST whenever the memory configuration has changed (see 
"Configuring AMP," below). 

• RBSU will now allow the system to be configured for Online Spare, even when the DIMM 
configuration is invalid (see "Configuring AMP," below). 

• Correctable threshold errors are still monitored after an Online Spare switchover. On the previous 
generation (G3) systems, after an Online Spare switchover had occurred, the next time any DIMM 
exceeded its correctable threshold, it would be detected and reported. After that, though, if any 
more DIMMs exceeded their correctable threshold on a G3 system, this would not be detected by 
the system and would not be reported (see "Managing Failures in a system with online spare 
enabled," below). G4 systems will continue to monitor correctable threshold errors until all banks of 
memory have exceeded the correctable error threshold. 

• Symmetric Memory Mode feature added (See "Symmetric Memory" section). 

• Background Scrubbing feature added (See "What is scrubbing?" section). 

• The Online Spare Bank is periodically tested during normal system operation. If an uncorrectable 
error is detected in the Online Spare Bank prior to an Online Spare switchover, the system will 
continue to operate normally, but the system will not switch to the Online Spare Bank if a DIMM 
exceeds the correctable error threshold. This prevents the system from switching over to memory that 
would result in an uncorrectable memory error and subsequent crash. 

There is no additional software required to enable AMP features. Previous generations (G2) required 
the system management (health) driver to be loaded. The health driver is not required in 300 series 
G4 servers but is required to provide console messaging and messaging with System Insight 
Manager. 

Dual rank vs. single rank DIMMs 

DIMMs can be classified as single- or dual-rank. Single rank and dual rank DIMMs (also known as 
single and double-sided DIMMs) may be the same capacity, but are not equivalent in all cases. For 
instance, many ProLiant servers require installing DIMMs in pairs. In this instance, a 1 GB single-rank 
DIMM is not equivalent to a 1 GB dual-rank DIMM. Also, single- and dual-rank DIMMs of the same 
capacity may not be equivalent for certain population rules in AMP modes. On some ProLiant 
products, combinations of these will work and on other products they may not. There may also be 
certain configuration rules to allow combinations of single and dual-rank DIMMs to operate together. 
Refer to the product QuickSpecs for specific details on what DIMMs are supported on your server. 

Configuration rules for online spare 

The configuration rules for Online Spare are very simple. If these rules are not followed and the 
system is configured for Online Spare mode, the system will automatically boot into Advanced ECC 
mode. On each reboot, the system will attempt to enter Online Spare mode as long as the system is 
configured for Online Spare mode. A warning message will be displayed at POST and logged to the 
Integrated Management Log (IML). When the user configures the system for Online Spare mode, the 
system will remain configured for this mode even if the system boots in Advanced ECC mode. 

The general configuration rules for Advanced Memory Protection are: 

• The banks are designated A, B, C, and D. 

• Memory must be populated sequentially, starting with bank A. 

• DIMMs must be the same capacity and have the same number of ranks within a bank. 

• On systems that support DDRII memory, dual-rank DIMMs are not supported. 
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• On systems that support DDRI DIMMs, dual-rank DIMMs must be populated after single-rank 
DIMMs. 

• The last bank populated is the Online Spare bank. 

• The Online Spare bank must have equal or larger capacity DIMMs to all other DIMMs in the 
system. See "Note," below. 



Note 

This simple rule is true with HP Memory Option kits. For third party 
memory, the true requirement is that the Online Spare bank must have 
greater than or equal to the amount of memory in each rank of the Online 
Spare bank. If a dual-rank Online Spare bank is used, and another bank in 
the system is populated with single-rank DIMMs, each rank of the Online 
Spare bank must be at least as large as the rank in the single-rank DIMM. 
To simplify, if single and dual-rank DIMMs are mixed in a system, Online 
Spare can only be supported if the Online Spare bank is at least twice as 
large as the bank with single-rank DIMMs. HP Memory Option kits are 
configured such the previous simple rule can be followed. 



• The system will attempt to boot in Online Spare with every reboot and will downgrade to Advanced 
ECC if the configuration rules are not met. 

Configuring the AMP mode 

Configuring the AMP mode requires very little work once the user determines the desired mode. The 
desired AMP mode is selected via ROM-Based Setup Utility (RBSU). It is recommended that the 
system's DIMM configuration be set up properly to support the desired AMP mode prior to running 
RBSU to enable the desired AMP mode. 

Also, it is highly recommended that the memory test is run with the system configured for Advanced 
ECC once the memory is added to the system. This helps to verify that the memory in the online spare 
bank has been tested and is working properly. The spare memory is not fully tested again after the 
system is configured for Online Spare and will not be utilized until the switchover occurs between the 
primary bank and the designated spare bank. With ProLiant 300-series G4 platforms, a very basic 
verification of the Online Spare bank is periodically run during normal operation. However, it is still 
recommended that a more thorough test be ran on the spare bank prior to configuring the system for 
Online Spare Mode. The POST memory test can be used for this purpose. 

The POST memory test will automatically be executed in POST anytime the memory configuration is 
changed. When memory is replaced with DIMMs of the same capacity, a memory configuration 
change will not be detected, so the user should follow the following steps to manually test the 
memory: 

1 . Under Advanced Options in RBSU, change the setting POST Speed Up to disable (enabled by 
default.) 

2. Make sure that the selected AMP Mode is Advanced ECC (this is the default). This option is also in 
RBSU under Advanced Options and Advanced Memory Protection. 

3. Reboot. All the memory will be tested. This may take a few minutes, depending on how much 
memory is installed in your system. Once the memory has been tested, you can enable POST Speed 
Up again for faster system boot. 

Another option for verifying the Online Spare bank is to run the ROM-based Diagnostics Memory Test 
when the system is configured for Advanced ECC Mode. The ROM-based Diagnostics Memory Test is 
entered via entering the System Maintenance menu at the end of POST (enter the System 
Management menu by pressing the F10 key at the prompt late in POST). Using this memory test 
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prevents the user from having to enter RBSU to disable POST speedup, reboot to allow POST to test 
the memory, and then once again entering RBSU to enable POST speedup. However, just as with the 
POST memory test, the system must be configured for Advanced ECC to allow the Online Spare bank 
to be tested. 

Once the memory has been tested, enable Online Spare: 

1 . At the prompt at the end of POST, press the F9 key to enter RBSU. 

2. From the RBSU main menu, select System Options. 

3. Using the arrow key, select Advanced Memory Protection. 

4. To activate Online Spare Memory, highlight Online Spare and press the Enter key. Online Spare 
is selected as soon as Enter is pressed. (The default option is Advanced ECC, providing maximum 
memory size for applications that require a large memory footprint.) 

5. Press the Fl 0 key to exit RBSU and the server will automatically re-boot. Upon reboot, the system 
will be in Online Spare mode if supported by the installed DIMM configuration. 
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As the server reboots subsequent to enabling Online Spare Memory, the following message displays: 

Advanced Memory Protection Mode: Online Spare with Advanced ECC xxxxMB System Memory and 
xxxxMB memory reserved for Online Spare 



Note 

RBSU will allow the user to configure an AMP mode for which the current 
DIMM configuration does not support, but it will display a warning 
message when making the selection. If the user enables an AMP mode not 
supported by the current DIMM configuration, the system will boot in 
Advanced ECC mode (though the system will still be configured for Online 
Spare mode) on the next reboot and an error message will be displayed 
during POST indicated the desired AMP mode is not supported by the 
current DIMM configuration. 
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Online Spare mode does not require operating system support. It can be enabled and function 
properly with any operating system. In addition, no special software beyond the System BIOS is 
required for the proper function of Online Spare mode. However, if messaging and logging are 
desired at the console along with messages in Insight Manager, an operating system must be used 
that has system management and agent support for AMP. On ProLiant 300-series G4 servers, these 
operating systems include: 

• Microsoft® Windows® 2000 Server 

• Microsoft Windows 2000 Advanced Server 

• Microsoft Windows 2003 Server 

• Microsoft Windows 2003 Enterprise Server 

• Linux 

• Novell NetWare 6.x 

Managing failures in a system with online spare enabled 

When the correctable error threshold is exceeded for a DIMM with Online Spare enabled, the system 
will copy the contents of the failing bank to the online spare bank. The health driver will log a 
message to the console and to the event log indicating that a threshold has been exceeded and a 
copy has been initiated. If the agents are loaded, System Insight Manager will indicate the memory is 
in a degraded state. In addition, the following LEDs will indicate that the error has occurred: 

• The internal health LED will indicate caution. 

• The LED next to the DIMM exceeding the correctable error threshold will illuminate amber. 

• The Online Spare Status LED will illuminate amber, indicating an online spare copy has been 
initiated. 

After the copy is completed, the failing bank will be deactivated and the Online Spare bank will 
become active. The health driver will log a message to the console and to the event log to indicate 
that the copy is complete. 
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As mentioned previously, the ProLiant 300-series G4 servers periodically verify the online spare bank 
during normal operation. If a potential uncorrectable error in the spare bank is detected, support for 
switching over to the spare bank will be disabled. The health driver will log a message to the console 
and to the IML. In addition, the internal health LED will indicate a degraded state, the Online Spare 
Status LED will illuminate amber, and the DIMM LEDs for the spare DIMMs will illuminate amber. If a 
potential uncorrectable error is detected in the online spare bank prior to an online spare switchover, 
the system will continue to operate normally, just without the protection of Online Spare mode. The 
system will not crash or NMI. 

A system in Online Spare mode does not have full protection from uncorrectable errors since Online 
Spare does not provide this level of protection. An uncorrectable memory error will result in a system 
crash and NMI. However, a system in Online Spare mode has a reduction in the probability of 
receiving an uncorrectable error because DIMMs that are exceeding the correctable error threshold, 
and thus at higher risk of receiving an uncorrectable error, are deactivated. Once the switchover has 
occurred to the spare bank or once support for switching over to the spare bank has been disabled in 
the case that the system detects a potential uncorrectable error in the non-active online spare bank, 
the system no longer has a spare bank to switch over to in the event that the correctable error 
threshold is exceeded on another bank. These events would then be treated as if the system were in 
Advanced ECC mode. The health driver would report the event and the failed DIMM's LED will 
illuminate, but no switchover will occur and the failing memory will remain active. The system can be 
powered off at the user's convenience to replace the failed DIMMs. It is important to note that with 
each reboot, the system will continue to attempt to boot off the original memory. It is expected in this 
case that the memory will fail again at some point and the spare bank will again become active. 



9 



Symmetric memory mode 



Symmetric Memory mode is a performance enhancement supported with certain DIMM configurations 
on the ProLiant 300-series G4 servers. With either four identical dual-rank DIMMs or eight identical 
single-rank DIMMs installed in a G4 system configured for Advanced ECC Mode, Symmetric Memory 
mode will automatically be enabled. When enabled, the system takes advantage of the particular 
DIMM configuration to improve performance. This mode cannot be entered when configured for 
Online Spare mode. 

Memory Scrubbing 
What is memory scrubbing? 

There are two types of memory scrubbing, demand and background. ProLiant 300-series G4 servers 
support both types of memory scrubbing. Previous generations of 300-series platforms typically 
supported only demand scrubbing. 

Demand scrubbing is a feature that allows the system to differentiate between soft and hard 
correctable memory errors. It allows the system to ignore soft errors while still notifying the customer of 
a true DIMM failure. When a correctable memory error occurs due to a soft error, correct data is 
written back to the memory device. This prevents the same soft error from occurring again in the 
future. This prevents a soft error from resulting in a DIMM exceeding the correctable error threshold. 

Background scrubbing reduces the chances that multiple soft errors will result in an uncorrectable 
memory error. With background scrubbing, the system is constantly correcting any potential soft 
errors "in the background." Without affecting normal system operation, the memory controller will 
continuously perform read/write operations on memory correcting any soft errors that may exist in 
memory. By doing this, a soft error that would result in a single-bit failure will likely be corrected 
before another soft error potentially occurs which might result in multiple-bits of corrupted data. If 
multiple bits of data are corrupted, an uncorrectable error could result causing the system to crash 
(see description of Advanced ECC protection). 

See the section below for additional detail on how demand and background scrubbing work. 

Detailed description of memory scrubbing 

As mentioned previously, Advanced ECC and Standard ECC utilize data and ECC bits to perform 
memory error detection and correction. Through the data and ECC bits, the system can correct certain 
memory errors. For Standard ECC, the system can correct single-bit errors. For Advanced ECC, the 
system can correct single or multi-bit errors as long as all failed bits are on the same DRAM device on 
the DIMM. If the ECC bits are correct for the corresponding data, then no error occurs when a 
memory read occurs. However, if the data does not match the ECC bits, then an error occurs when a 
memory read occurs. In many cases, the proper data can be reconstructed through using the ECC 
check bits, resulting in a correctable error. The data and ECC bits are checked on memory reads to 
detect and potentially correct errors. The system writes correct data and ECC bits on memory writes. 

Also mentioned previously, soft errors are errors that occur in memory but which do not indicate a 
hardware problem. A cosmic ray hitting the DRAM device on a DIMM can in rare cases cause one or 
more bits to change states. This can result in a soft correctable or soft uncorrectable error. It is 
extremely rare to have multiple soft errors occur to the same memory location that would result in an 
uncorrectable error due to a soft error. Since soft memory errors don't indicate a problem with the 
hardware, and are simply the result of a bit in memory changing state due to cosmic rays, the 
memory error will only occur until the data and ECC bits are properly written back to the DIMM. 
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Since the data and ECC bits are written to the DIMM on memory writes, and checked on memory 
reads. A soft error could result in multiple correctable memory errors occurring if the processor 
continually read a memory location containing a soft memory error. If a write to that memory location 
occurred, the error would disappear. However, after the soft error results in the data and ECC bits 
being out of synch., every read to that memory location would result in a correctable error until a 
write to that memory location occurred. This could result in a soft error resulting in a DIMM exceeding 
the correctable error threshold. 

Memory scrubbing is a method of solving this problem. There are two types of memory scrubbing 
supported by the ProLiant 300-series G4 platforms. The G4 systems and previous generations have 
supported something known as demand scrubbing. The G4 systems are the first ProLiant servers to 
support what is known as background or patrol scrubbing. 

Demand scrubbing solves the problem of obtaining multiple correctable errors due to a single soft 
error, and thus the problem of potentially reporting a correctable threshold error due to soft errors. 
Whenever the system detects a correctable error, the system will correct the data and pass the data to 
the requester, whether that be the processor or a DMA capable device. With demand scrubbing, the 
correct data and ECC check bits will also be written back to memory. In other words, when the system 
detects a correctable error via the data and ECC bits, it writes back the proper data and ECC bits to 
memory. Thus, subsequent reads of the same memory location will not result in a correctable error if 
the error was simply a soft error. If there was a hard error and something actually wrong with the 
DIMM, writing the correct data and ECC bits back to memory would typically not correct the problem, 
and additional correctable errors will occur on subsequent reads. 

Background scrubbing (also known as patrol scrubbing) is a very similar process. Instead of only 
reading the data and ECC bits, correcting them, and writing them back to memory when a 
correctable memory error occurs, the system will constantly be reading and writing memory locations. 
Thus, the system will be constantly scrubbing all of the contents of memory in an effort to correct soft 
errors before a correctable error even occurs. Even if a particular section of memory is not being 
accessed by software or DMA capable devices, background scrubbing will correct any soft errors that 
exist in the memory. Background scrubbing occurs at a very slow rate and only when the memory bus 
is available. Thus, the memory accesses due to background scrubbing do not affect normal system 
operation or system performance. If background scrubbing detects an uncorrectable memory error, it 
does not cause the system to crash or result in an NMI. 

Background scrubbing serves two purposes. First, it reduces the chances of the system receiving a 
correctable error on memory reads initiated by software or DMA-capable devices. While demand 
scrubbing prevents multiple correctable errors due to a soft error, one correctable error will occur on 
the initial memory read access. If a soft error occurs in memory (ie. if a cosmic ray inverts a bit on a 
DRAM device), the background scrub may correct the error before any normal memory read occurs to 
the memory location that had been affected. Second and more importantly, background scrubbing 
reduces the chances of an uncorrectable error occurring due to a soft error. Although rare, it is 
possible that a portion of memory that is not being accessed for a long time could have multiple bit 
positions inverted by cosmic rays. For instance, if a bit in memory is inverted by a cosmic ray, but the 
memory is never read or written for a relatively long period of time, this would leave a window where 
an additional bit in the same memory location could be inverted by cosmic rays. In this case, multiple 
bits in the memory location could be inverted, which could potentially result in a system crash and 
NMI when the memory is read (in Advanced ECC, the system crash would occur if both inverted bits 
were not in the same DRAM device on the DIMM). 
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